Load the data

In [1]:
%matplotlib inline 
%config InlineBackend.figure_format='retina'
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import folium
import seaborn as sns
from pandas_profiling import ProfileReport
import matplotlib as mpl
import scipy
import numpy as np
In [2]:
data = pd.read_csv("../data/aquastat.csv.gzip", compression="gzip")
data.head()
Out[2]:
country region variable variable_full time_period year_measured value
0 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1958-1962 1962.0 65286.0
1 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1963-1967 1967.0 65286.0
2 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1968-1972 1972.0 65286.0
3 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1973-1977 1977.0 65286.0
4 Afghanistan World | Asia total_area Total area of the country (1000 ha) 1978-1982 1982.0 65286.0
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 143280 entries, 0 to 143279
Data columns (total 7 columns):
country          143280 non-null object
region           143280 non-null object
variable         143280 non-null object
variable_full    143280 non-null object
time_period      143280 non-null object
year_measured    96411 non-null float64
value            96411 non-null float64
dtypes: float64(2), object(5)
memory usage: 7.7+ MB

Univariate analysis

In [4]:
data[["variable", "variable_full"]].drop_duplicates()
Out[4]:
variable variable_full
0 total_area Total area of the country (1000 ha)
576 arable_land Arable land area (1000 ha)
1152 permanent_crop_area Permanent crops area (1000 ha)
1728 cultivated_area Cultivated area (arable land + permanent crops...
2304 percent_cultivated % of total country area cultivated (%)
2880 total_pop Total population (1000 inhab)
3456 rural_pop Rural population (1000 inhab)
4032 urban_pop Urban population (1000 inhab)
4608 gdp Gross Domestic Product (GDP) (current US$)
5184 gdp_per_capita GDP per capita (current US$/inhab)
5760 agg_to_gdp Agriculture, value added to GDP (%)
6336 human_dev_index Human Development Index (HDI) [highest = 1] (-)
6912 gender_inequal_index Gender Inequality Index (GII) [equality = 0; i...
7488 percent_undernourished Prevalence of undernourishment (3-year average...
8064 number_undernourished Number of people undernourished (3-year averag...
8640 avg_annual_rain_depth Long-term average annual precipitation in dept...
9216 avg_annual_rain_vol Long-term average annual precipitation in volu...
9792 national_rainfall_index National Rainfall Index (NRI) (mm/year)
10368 surface_water_produced Surface water produced internally (10^9 m3/year)
10944 groundwater_produced Groundwater produced internally (10^9 m3/year)
11520 surface_groundwater_overlap Overlap between surface water and groundwater ...
12096 irwr Total internal renewable water resources (IRWR...
12672 irwr_per_capita Total internal renewable water resources per c...
13248 surface_entering Surface water: entering the country (total) (1...
13824 surface_inflow_submit_no_treaty Surface water: inflow not submitted to treatie...
14400 surface_inflow_submit_treaty Surface water: inflow submitted to treaties (1...
14976 surface_inflow_secure_treaty Surface water: inflow secured through treaties...
15552 total_flow_border_rivers Surface water: total flow of border rivers (10...
16128 accounted_flow_border_rivers Surface water: accounted flow of border rivers...
16704 accounted_flow Surface water: accounted inflow (10^9 m3/year)
17280 surface_to_other_countries Surface water: leaving the country to other co...
17856 surface_outflow_submit_no_treaty Surface water: outflow to other countries not ...
18432 surface_outflow_submit_treaty Surface water: outflow to other countries subm...
19008 surface_outflow_secure_treaty Surface water: outflow to other countries secu...
19584 surface_total_external_renewable Surface water: total external renewable (10^9 ...
20160 groundwater_entering Groundwater: entering the country (total) (10^...
20736 groundwater_accounted_inflow Groundwater: accounted inflow (10^9 m3/year)
21312 groundwater_to_other_countries Groundwater: leaving the country to other coun...
21888 groundwater_accounted_outflow Groundwater: accounted outflow to other countr...
22464 water_total_external_renewable Water resources: total external renewable (10^...
23040 total_renewable_surface Total renewable surface water (10^9 m3/year)
23616 total_renewable_groundwater Total renewable groundwater (10^9 m3/year)
24192 overlap_surface_groundwater Overlap: between surface water and groundwater...
24768 total_renewable Total renewable water resources (10^9 m3/year)
25344 dependency_ratio Dependency ratio (%)
25920 total_renewable_per_capita Total renewable water resources per capita (m3...
26496 exploitable_regular_renewable_surface Exploitable: regular renewable surface water (...
27072 exploitable_irregular_renewable_surface Exploitable: irregular renewable surface water...
27648 exploitable_total_renewable_surface Exploitable: total renewable surface water (10...
28224 exploitable_regular_renewable_groundwater Exploitable: regular renewable groundwater (10...
28800 exploitable_total Total exploitable water resources (10^9 m3/year)
29376 interannual_variability Interannual variability (WRI) (-)
29952 seasonal_variability Seasonal variability (WRI) (-)
30528 total_dam_capacity Total dam capacity (km3)
31104 dam_capacity_per_capita Dam capacity per capita (m3/inhab)
31680 irrigation_potential Irrigation potential (1000 ha)
32256 flood_occurence Flood occurrence (WRI) (-)
32832 total_pop_access_drinking Total population with access to safe drinking-...
33408 rural_pop_access_drinking Rural population with access to safe drinking-...
33984 urban_pop_access_drinking Urban population with access to safe drinking-...

How many conutries data do we have?

In [5]:
countries = data.country.unique()
data.country.nunique()
Out[5]:
199

How many different time periods of data we have?

In [6]:
data.time_period.nunique()
Out[6]:
12
In [7]:
time_periods = data.time_period.unique()
time_periods
Out[7]:
array(['1958-1962', '1963-1967', '1968-1972', '1973-1977', '1978-1982',
       '1983-1987', '1988-1992', '1993-1997', '1998-2002', '2003-2007',
       '2008-2012', '2013-2017'], dtype=object)

So, we have 12 time periods from 1958 to 2017 with a window of 5 years

How to look at the data

  • All countries during single time period
  • A single country over time
  • All countries over all time periods - typically, the entire data for a given variable
  • All countries that are geograpgically related

All countries during single time period

In [8]:
def time_slice(df, time_period):
    """
    On the given dataframe, 
        1. slice the data in the given time period.
        2. Pivot the dataframe to get all the values of the 'variable' to a column 
           and populate the cell with the data in the 'value' column
        3. Update the column name to the time_period
    """
    
    df = df[df["time_period"] == time_period]
    df = df.pivot(index="country", columns="variable", values="value")
    df.columns.name = time_period
    
    return df
    
In [9]:
time_slice(data, time_periods[0]).head()
Out[9]:
1958-1962 accounted_flow accounted_flow_border_rivers agg_to_gdp arable_land avg_annual_rain_depth avg_annual_rain_vol cultivated_area dam_capacity_per_capita dependency_ratio exploitable_irregular_renewable_surface ... total_flow_border_rivers total_pop total_pop_access_drinking total_renewable total_renewable_groundwater total_renewable_per_capita total_renewable_surface urban_pop urban_pop_access_drinking water_total_external_renewable
country
Afghanistan 19.00 9.0 NaN 7700.0 327.0 213.5000 7760.0 128.40 28.7200 NaN ... 33.4 9344.00 NaN 65.3300 10.650 6992.0 55.68 804.90 NaN 18.18
Albania 3.30 0.0 NaN 436.0 1485.0 42.6900 487.0 NaN 10.9300 NaN ... 0.0 1738.00 NaN 30.2000 6.200 17376.0 26.35 533.20 NaN 3.30
Algeria 0.39 0.0 NaN 6300.0 89.0 212.0000 6900.0 89.99 3.5990 5.0 ... 0.0 11690.00 NaN 11.6700 1.517 998.3 10.15 3934.00 NaN 0.42
Andorra NaN NaN NaN 1.0 NaN 0.4724 1.0 NaN NaN NaN ... NaN 15.38 NaN 0.3156 NaN 20520.0 NaN 9.76 NaN NaN
Angola 0.40 0.0 NaN 2700.0 1010.0 1259.0000 3200.0 25.96 0.2695 NaN ... 0.0 5466.00 NaN 148.4000 58.000 27150.0 145.40 577.00 NaN 0.40

5 rows × 60 columns

A single country over time

In [10]:
def country_slice(df, country):
    df = df[df.country == country]
    df = df.pivot(index="variable", columns="time_period", values="value")
    df.columns.name = country
    
    return df
In [11]:
country_slice(data, countries[0]).head()
Out[11]:
Afghanistan 1958-1962 1963-1967 1968-1972 1973-1977 1978-1982 1983-1987 1988-1992 1993-1997 1998-2002 2003-2007 2008-2012 2013-2017
variable
accounted_flow 19.0 19.0 19.0 19.0 19.0 19.0 19.0 19.0 19.00 19.00 19.0 19.0
accounted_flow_border_rivers 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.00 9.00 9.0 9.0
agg_to_gdp NaN NaN NaN NaN NaN NaN NaN NaN 38.47 30.62 24.6 22.6
arable_land 7700.0 7844.0 7910.0 7910.0 7910.0 7910.0 7910.0 7685.0 7678.00 7794.00 7790.0 7771.0
avg_annual_rain_depth 327.0 327.0 327.0 327.0 327.0 327.0 327.0 327.0 327.00 327.00 327.0 327.0

All countries over all time periods - typically, the entire data for a given type of variable

In [12]:
def variable_slice(df, variable):
    df = df[df.variable == variable]
    df = df.pivot(index="country", columns="time_period", values="value")
    
    return df
In [13]:
variable_slice(data, 'total_pop').head()
Out[13]:
time_period 1958-1962 1963-1967 1968-1972 1973-1977 1978-1982 1983-1987 1988-1992 1993-1997 1998-2002 2003-2007 2008-2012 2013-2017
country
Afghanistan 9344.00 10369.00 11717.00 13056.00 12667.00 11338.00 13746.0 18034.00 21487.00 25878.00 29727.00 32527.00
Albania 1738.00 1999.00 2254.00 2518.00 2788.00 3121.00 3241.0 3092.00 3123.00 3011.00 2881.00 2897.00
Algeria 11690.00 13354.00 15377.00 17690.00 20576.00 23918.00 27181.0 29888.00 31990.00 34262.00 37439.00 39667.00
Andorra 15.38 20.75 26.89 32.77 39.11 48.46 58.9 64.15 71.05 84.88 79.32 70.47
Angola 5466.00 5963.00 6588.00 7501.00 8808.00 10286.00 11849.0 13802.00 16110.00 19184.00 22686.00 25022.00
In [14]:
data.region.unique()
Out[14]:
array(['World | Asia',
       'Americas | Central America and Caribbean | Central America',
       'Americas | Central America and Caribbean | Greater Antilles',
       'Americas | Central America and Caribbean | Lesser Antilles and Bahamas',
       'Americas | Northern America | Northern America',
       'Americas | Northern America | Mexico',
       'Americas | Southern America | Guyana',
       'Americas | Southern America | Andean',
       'Americas | Southern America | Brazil',
       'Americas | Southern America | Southern America', 'World | Africa',
       'World | Europe', 'World | Oceania'], dtype=object)

Create simpler regions for better analysis

In [15]:
simpler_regions = {
    'World | Asia':'Asia',
    'Americas | Central America and Caribbean | Central America': 'North America',
    'Americas | Central America and Caribbean | Greater Antilles': 'North America',
    'Americas | Central America and Caribbean | Lesser Antilles and Bahamas': 'North America',
    'Americas | Northern America | Northern America': 'North America',
    'Americas | Northern America | Mexico': 'North America',
    'Americas | Southern America | Guyana':'South America',
    'Americas | Southern America | Andean':'South America',
    'Americas | Southern America | Brazil':'South America',
    'Americas | Southern America | Southern America':'South America', 
    'World | Africa':'Africa',
    'World | Europe':'Europe', 
    'World | Oceania':'Oceania'
}
In [16]:
data.region = data.region.apply(lambda x: simpler_regions[x])
In [17]:
data.region.unique()
Out[17]:
array(['Asia', 'North America', 'South America', 'Africa', 'Europe',
       'Oceania'], dtype=object)
In [18]:
def get_subregion(data, region):
    return data[data.region == region]

Missing data

In [19]:
recent = time_slice(data, time_periods[-1])
recent.head()
Out[19]:
2013-2017 accounted_flow accounted_flow_border_rivers agg_to_gdp arable_land avg_annual_rain_depth avg_annual_rain_vol cultivated_area dam_capacity_per_capita dependency_ratio exploitable_irregular_renewable_surface ... total_flow_border_rivers total_pop total_pop_access_drinking total_renewable total_renewable_groundwater total_renewable_per_capita total_renewable_surface urban_pop urban_pop_access_drinking water_total_external_renewable
country
Afghanistan 19.00 9.0 22.6000 7771.0 327.0 213.5000 7910.0 61.76 28.7200 NaN ... 33.4 32527.00 55.3 65.3300 10.650 2008.0 55.68 8547.0 78.2 18.18
Albania 3.30 0.0 22.0500 615.6 1485.0 42.6900 696.0 1391.00 10.9300 NaN ... 0.0 2897.00 95.1 30.2000 6.200 10425.0 26.35 1835.0 94.9 3.30
Algeria 0.39 0.0 13.0500 7469.0 89.0 212.0000 8439.0 209.30 3.5990 NaN ... 0.0 39667.00 83.6 11.6700 1.517 294.2 10.15 28739.0 84.3 0.42
Andorra NaN NaN 0.5239 2.8 NaN 0.4724 2.8 NaN NaN NaN ... NaN 70.47 100.0 0.3156 NaN 4479.0 NaN 68.9 100.0 NaN
Angola 0.40 0.0 NaN 4900.0 1010.0 1259.0000 5190.0 377.50 0.2695 NaN ... 0.0 25022.00 49.0 148.4000 58.000 5931.0 145.40 10052.0 75.4 0.40

5 rows × 60 columns

In [20]:
msno.bar(recent, labels=True)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x2536c136cc8>

This clearly shows that we are not having data for every country at every time period. i.e., we are having imbalanced data

In [21]:
msno.matrix(recent, labels=True)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x2536cd8b848>

We can see the 'exploitable' data is missed for most of the coutries. Does this happen in every time period?

In [22]:
msno.matrix(variable_slice(data, 'exploitable_total'), inline=False, sort='descending')
plt.xlabel("Time Period")
plt.ylabel("Country")
plt.title("Missing total exploitable water resources data across countries and time periods \n \n \n \n")
Out[22]:
Text(0.5, 1.0, 'Missing total exploitable water resources data across countries and time periods \n \n \n \n')

Possible reasons to have such missing data for exploitable resources are

  1. Data hasn't been reported yet
  2. Most coutries have stopped reporting on this factor.
  3. I don't have domain knowledge to understand what's happening.

A column with such few data points doesn't add any value in further analysis. Hence, dropping

In [23]:
data = data.loc[~data.variable.str.contains('exploitable'),:]

As per the recent data, 'national_rainfall_index' has no data at all.

In [24]:
msno.matrix(variable_slice(data, 'national_rainfall_index'), inline=False, sort='descending')
plt.xlabel("Time Period")
plt.ylabel("Country")
plt.title("Missing total national_rainfall_index water resources data across countries and time periods \n \n \n \n")
Out[24]:
Text(0.5, 1.0, 'Missing total national_rainfall_index water resources data across countries and time periods \n \n \n \n')

from 2002, this data hasn't been reported at all. Hence dropping this

In [25]:
data = data.loc[~(data.variable == 'national_rainfall_index')]
In [26]:
north_america = get_subregion(data, 'North America')
msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending')
Out[26]:
2013-2017 accounted_flow accounted_flow_border_rivers agg_to_gdp arable_land avg_annual_rain_depth avg_annual_rain_vol cultivated_area dam_capacity_per_capita dependency_ratio flood_occurence ... total_flow_border_rivers total_pop total_pop_access_drinking total_renewable total_renewable_groundwater total_renewable_per_capita total_renewable_surface urban_pop urban_pop_access_drinking water_total_external_renewable
country
Guatemala 18.710 0.000 11.1300 933.8 1996.0 217.3000 1995.0 28.3800 14.630 3.5 ... 0.000 16343.00 92.8 127.900 33.700 7826.0 119.400 8383.00 98.4 18.710
Costa Rica 0.000 0.000 5.6080 232.1 2926.0 149.5000 546.1 415.3000 0.000 3.1 ... 0.000 4808.00 97.8 113.000 37.310 23502.0 113.000 3842.00 99.6 0.000
Dominican Republic 0.000 0.000 6.5740 800.0 1410.0 68.6200 1155.0 218.6000 0.000 3.7 ... 0.000 10528.00 84.7 23.500 4.161 2232.0 23.500 8413.00 85.4 0.000
Panama 2.704 2.704 2.8910 563.0 2928.0 220.8000 748.0 2326.0000 1.941 2.8 ... 5.409 3929.00 94.7 139.300 21.000 35454.0 135.900 2656.00 97.7 2.704
Nicaragua 8.310 0.000 18.8100 1504.0 2280.0 297.2000 1790.0 5263.0000 5.051 3.6 ... 0.000 6082.00 87.0 164.500 59.000 27047.0 160.900 3678.00 99.3 8.310
El Salvador 10.640 0.000 11.2800 750.0 1784.0 37.5400 965.0 633.1000 40.500 3.8 ... 0.000 6127.00 93.8 26.270 6.150 4288.0 22.690 4288.00 97.5 10.640
Honduras 1.504 0.000 13.8700 1020.0 1976.0 222.3000 1475.0 718.9000 1.632 3.6 ... 0.000 8075.00 91.2 92.160 39.000 11413.0 83.070 4610.00 97.4 1.504
Haiti 1.015 0.000 NaN 1070.0 1440.0 39.9600 1350.0 27.7300 7.237 3.9 ... 0.000 10711.00 57.7 14.030 2.157 1310.0 11.870 6219.00 64.9 1.015
Trinidad and Tobago 0.000 0.000 0.4767 25.0 2200.0 11.2900 47.0 52.5700 0.000 2.2 ... 0.000 1360.00 95.1 3.840 0.614 2824.0 3.740 113.70 95.1 0.000
Belize 6.474 0.432 15.5200 78.0 1705.0 39.1600 110.0 338.7000 29.790 2.6 ... 0.864 359.30 99.5 21.730 7.510 60479.0 21.730 152.80 98.9 6.474
Jamaica 0.000 0.000 7.4980 120.0 2051.0 22.5400 215.0 1.9690 0.000 3.5 ... 0.000 2793.00 93.8 10.820 5.472 3874.0 9.111 1541.00 97.5 0.000
United States of America 251.000 0.000 1.3340 154605.0 715.0 7030.0000 157205.0 2287.0000 8.179 3.1 ... 0.000 321774.00 99.2 3069.000 1383.000 9538.0 2913.000 265361.00 99.4 251.000
Mexico 53.320 0.000 3.7060 22993.0 758.0 1489.0000 25670.0 NaN 11.530 2.9 ... 0.000 127017.00 96.1 461.900 150.000 3637.0 402.900 99245.00 97.2 52.880
Canada 52.000 0.000 NaN 46015.0 537.0 5362.0000 50656.0 23414.0000 1.792 1.9 ... 0.000 35940.00 99.8 2902.000 370.000 80746.0 2892.000 29353.00 100.0 52.000
Cuba 0.000 0.000 NaN 3088.0 1335.0 146.7000 3515.0 NaN 0.000 3.0 ... 0.000 11390.00 94.9 38.120 6.480 3347.0 31.640 8670.00 96.4 0.000
Barbados 0.000 0.000 1.4440 11.0 1422.0 0.6115 12.0 NaN 0.000 3.6 ... 0.000 284.20 99.7 0.080 0.074 281.5 0.008 90.48 99.7 0.000
Saint Kitts and Nevis 0.000 0.000 1.3060 5.0 1427.0 0.3710 5.1 NaN 0.000 NaN ... 0.000 55.57 98.3 0.024 0.020 431.9 0.004 17.75 98.3 0.000
Grenada 0.000 0.000 7.8080 3.0 2350.0 0.7990 7.0 0.1873 0.000 NaN ... 0.000 106.80 96.6 0.200 NaN 1873.0 NaN 37.97 99.0 0.000
Antigua and Barbuda 0.000 0.000 2.3720 4.0 1030.0 0.4532 5.0 65.3500 0.000 3.6 ... 0.000 91.82 97.9 0.052 NaN 566.3 NaN 21.83 97.9 0.000
Saint Vincent and the Grenadines 0.000 0.000 7.8300 5.0 1583.0 0.6174 8.0 NaN 0.000 3.6 ... 0.000 109.50 95.1 0.100 NaN 913.2 NaN 55.29 95.1 0.000
Saint Lucia 0.000 0.000 2.9950 3.0 2301.0 1.4270 10.0 NaN 0.000 3.6 ... 0.000 185.00 96.3 0.300 NaN 1622.0 NaN 34.22 99.5 0.000
Bahamas 0.000 0.000 1.7680 8.0 1292.0 17.9300 12.0 NaN 0.000 NaN ... 0.000 388.00 98.4 0.700 NaN 1804.0 NaN 321.20 98.4 0.000
Dominica 0.000 0.000 14.6800 6.0 2083.0 1.5620 23.0 NaN 0.000 3.6 ... 0.000 72.68 NaN 0.200 NaN 2752.0 NaN 50.54 95.7 0.000

23 rows × 54 columns

Looking at North America data

In [27]:
north_america = get_subregion(data, 'North America')
msno.bar(msno.nullity_sort(time_slice(north_america, '2013-2017'), sort='descending').T, inline=False)
plt.title('Fraction of fields complete by country for North America \n \n');

Is there any pattern in the coutries with most missing data? What are the potential reasons for missing data?

In [28]:
# Bahamas location
folium.Map(location=[18.1160128,-77.8364762], tiles="CartoDB positron",  zoom_start=5, width=1200, height=600)
Out[28]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Check what data is missing in Bahamas

In [29]:
msno.nullity_filter(country_slice(data, 'Bahamas').T, filter="bottom", p=0.1)
Out[29]:
variable dam_capacity_per_capita flood_occurence gender_inequal_index groundwater_produced interannual_variability irrigation_potential number_undernourished overlap_surface_groundwater percent_undernourished seasonal_variability surface_groundwater_overlap surface_water_produced total_dam_capacity total_renewable_groundwater total_renewable_surface
Bahamas
1958-1962 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1963-1967 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1968-1972 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1973-1977 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1978-1982 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1983-1987 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1988-1992 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1993-1997 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1998-2002 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2003-2007 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2008-2012 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2013-2017 NaN NaN 0.2979 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [30]:
# JSON with co-ordinates for country boundries
geo_data_path = r'../data/world.json'
In [31]:
null_data = recent['agg_to_gdp'].notnull()*1
map = folium.Map(location=[48, -102], zoom_start=2)
map.choropleth(
    geo_data=geo_data_path,
    data=null_data,
    columns=['country', 'agg_to_gdp'],
    key_on='feature.properties.name', reset=True,
    fill_color='GnBu', fill_opacity=1, line_opacity=0.2,
    legend_name='Missing agricultural contribution to GDP data 2013-2017'
)
map
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\folium\folium.py:415: FutureWarning: The choropleth  method has been deprecated. Instead use the new Choropleth class, which has the same arguments. See the example notebook 'GeoJSON_and_choropleth' for how to do this.
  FutureWarning
Out[31]:
Make this Notebook Trusted to load map: File -> Trust Notebook

How countries are providing data over the years

In [32]:
fig, ax = plt.subplots(figsize=(16,16))
sns.heatmap(data.groupby(['time_period', 'variable']).value.count().unstack().T, ax=ax)
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Variable');
plt.title('Number of countries with data reported for each variable over time');   

Data profiling

In [33]:
# ProfileReport(time_slice(data, '2013-2017'))

Exploring 'population'

Since its a continous variable, we are looking for the below aspects

  1. Location: mean, median, mode, interquartile mean
  2. Spread: standard deviation, variance, range, interquartile range
  3. Shape: skewness, kurtosis
In [34]:
recent[["total_pop", "urban_pop", "rural_pop"]].describe().astype(int)
Out[34]:
2013-2017 total_pop urban_pop rural_pop
count 199 199 199
mean 36890 19849 17040
std 140720 69681 77461
min 0 0 -98
25% 1368 822 500
50% 7595 3967 2404
75% 25088 11656 10677
max 1407306 805387 891112

We can see that the data is skewed.i.e., 50% quartile is closer to either 25% or 75% quartile.
Ex: for total_pop:
50% is 6227 units away from 25% and 17493 units away from 75%.
Also, there is a huge difference of 29295 units between the mean and the 50%.
Clearly the data is skewed.


The min value in rural_pop is negative. What does this imply?

In [35]:
recent.sort_values("rural_pop")[["total_pop", "urban_pop", "rural_pop"]].head()
Out[35]:
2013-2017 total_pop urban_pop rural_pop
country
Qatar 2235.00 2333.00 -98.00
Singapore 5604.00 5619.00 -15.00
Monaco 37.73 38.32 -0.59
Holy See 0.80 0.80 0.00
Nauru 10.22 10.12 0.10

From the glossary of the datasset, we have information that
$ rural pop = total pop - urban pop $

This validates the data we have above.

In [36]:
def time_series(df, country, variable):
    """
    Returns the time series data for a given country & variable
    1. Slice the data with matching country and variable
    2. Drop years with no data
    """
    
    series = df[(df.country==country) & (df.variable==variable)]
    series = series.dropna()[['year_measured', 'value']] 
    
    # Convert years to type int and set as index
    series.year_measured = series.year_measured.astype(int)
    series.set_index('year_measured', inplace=True)
    series.columns = [variable]
    
    return series
    
In [37]:
time_series(data, 'Qatar', 'total_pop').join(time_series(data, 'Qatar', 'urban_pop')).join(time_series(data, 'Qatar', 'rural_pop'))
Out[37]:
total_pop urban_pop rural_pop
year_measured
1962 56.19 48.39 7.80
1967 86.16 75.48 10.68
1972 130.40 115.60 14.80
1977 182.40 162.40 20.00
1982 277.20 248.60 28.60
1987 423.30 385.40 37.90
1992 489.70 459.10 30.60
1997 528.20 506.50 21.70
2002 634.40 608.90 25.50
2007 1179.00 1130.00 49.00
2012 2016.00 2029.00 -13.00
2015 2235.00 2333.00 -98.00

What do with the negative numbers of the rural population? Replace it with rough estimates from other sources?

Skewness and kurtosis

Skewness: Measure of lack of symmetry in the data distribution
Kurtosis: Kurtosis is a statistical measure that defines how heavily the tails of a distribution differ from the tails of a normal distribution.

In [38]:
recent[["total_pop", "urban_pop", "rural_pop"]].apply(scipy.stats.skew)
Out[38]:
2013-2017
total_pop    8.519379
urban_pop    8.545690
rural_pop    9.490029
dtype: float64

skewness for normal distribution should be 0
left skewed is shown by the negative values
right skewed is shown by the positive values

So, here our data is right skewed

In [39]:
recent[["total_pop", "urban_pop", "rural_pop"]].apply(scipy.stats.kurtosis)
Out[39]:
2013-2017
total_pop    76.923725
urban_pop    85.499659
rural_pop    95.838930
dtype: float64
In [40]:
fig, ax = plt.subplots(figsize=(12,8))
ax.hist(recent.total_pop.values, bins=50)
ax.set_xlabel('Total population')
ax.set_ylabel('Number of countries')
ax.set_title('Distribution of population of countries 2013-2017')
Out[40]:
Text(0.5, 1.0, 'Distribution of population of countries 2013-2017')

The data is definetly skewed! How to handle?

  1. Log transformations
  2. Data normalization
In [41]:
# modularize the code foe plotting nulls geospatially
def plot_map(df, variable, time_period=None, log=False, legend_name=None, threshold_scale=None):
    geo_path = r"../data/world.json"
    legend_name = legend_name if legend_name else "%s for %s" %(variable, time_period)
    
    if time_period:
        df = time_slice(df, time_period).reset_index()
    else:
        df = df.reset_index()
    
    if log: 
        df[variable] = df[variable].apply(np.log)
        
    map = folium.Map(location=[35, -45], zoom_start=2, width=800, height=400)
    map.choropleth(geo_data=geo_path, 
                   data=df,
                   columns=['country', variable],
                   key_on='feature.properties.name', reset=True,
                   fill_color='PuBuGn', fill_opacity=0.7, line_opacity=0.2,
                   legend_name=legend_name,
                   threshold_scale=threshold_scale)
    
    return map
In [42]:
plot_map(data, 'total_pop', '2013-2017', legend_name='Total population')
Out[42]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Log transformation

In [43]:
recent[['total_pop']].apply(np.log).apply(scipy.stats.skew)
Out[43]:
2013-2017
total_pop   -0.899063
dtype: float64

The skewness reduced from 8.5 to -0.8. IT definetly reduced. But now we have a little left skew.

In [44]:
recent[['total_pop']].apply(np.log).apply(scipy.stats.kurtosis)
Out[44]:
2013-2017
total_pop    1.086877
dtype: float64

The kurtosis reduced from 76.92 to 1.08

In [45]:
def plot_hist(df, variable, bins=20, xlabel=None, by=None,
              ylabel=None, title=None, logx=False, ax=None):

    if not ax:
        fig, ax = plt.subplots(figsize=(12,8))
    if logx:
        if df[variable].min() <=0:
            df[variable] = df[variable] - df[variable].min() + 1
            print('Warning: data <=0 exists, data transformed by %0.2g before plotting' % (- df[variable].min() + 1))
        
        bins = np.logspace(np.log10(df[variable].min()),
                           np.log10(df[variable].max()), bins)
        ax.set_xscale("log")

    ax.hist(df[variable].dropna().values, bins=bins);
    
    if xlabel:
        ax.set_xlabel(xlabel);
    if ylabel:
        ax.set_ylabel(ylabel);
    if title:
        ax.set_title(title);
    
    return ax
In [46]:
plot_hist(recent, 'total_pop', bins=25, logx=True, 
          xlabel='Log of total population', ylabel='Number of countries',
          title='Distribution of total population of countries 2013-2017');
In [47]:
plot_map(data, 'total_pop', '2013-2017', legend_name='Log of total population', log=True)
Out[47]:
Make this Notebook Trusted to load map: File -> Trust Notebook

We can see larger countries have larger populations - that makes sense... we may hypothesize that water availability may affect countries with higher population density rather than higher absolute populations.

Normalization

In [48]:
recent['population_density'] = recent.total_pop.divide(recent.total_area)
In [49]:
plot_map(recent, 'population_density', legend_name='Population density',threshold_scale=[0,0.3,0.8, 1.8, 78])
Out[49]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [50]:
# USA population over time
plt.plot(time_series(data, 'United States of America', 'total_pop'));
plt.xlabel('Year');
plt.ylabel('Population');
plt.title('United States population over time');
In [51]:
with sns.color_palette(sns.diverging_palette(220, 280, s=85, l=25, n=23)):
    north_america = time_slice(get_subregion(data, 'North America'), '1958-1962').sort_values('total_pop').index.tolist()
    for country in north_america:
        plt.plot(time_series(data, country, 'total_pop'), label=country);
        plt.xlabel('Year');
        plt.ylabel('Population');
        plt.title('North American populations over time');
    plt.legend(loc=2,prop={'size':5});

This graph gives no insights, other than just the fact that North America is a big country.
So, normaliza again with the min, to see how much each country grows with refernce to its starting population

In [52]:
with sns.color_palette(sns.diverging_palette(220, 280, s=85, l=25, n=23)):
    for country in north_america:
        ts = time_series(data, country, 'total_pop')
        ts['norm_pop'] = ts.total_pop/ts.total_pop.min()*100
        plt.plot(ts['norm_pop'], label=country);
        plt.xlabel('Year');
        plt.ylabel('Percent increase in population');
        plt.title('Percent increase in population from 1960 in North American countries');
    plt.legend(loc=2,prop={'size':5});
In [53]:
north_america_pop = variable_slice(get_subregion(data, 'North America'), 'total_pop')
north_america_norm_pop = north_america_pop.div(north_america_pop.min(axis=1), axis=0)*100
north_america_norm_pop = north_america_norm_pop.loc[north_america]
In [54]:
fig, ax = plt.subplots(figsize=(16, 12));
sns.heatmap(north_america_norm_pop, ax=ax, cmap=sns.light_palette((214, 90, 60), input="husl", as_cmap=True));
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Country, ordered by population in 1960 (<- greatest to least ->)');
plt.title('Percent increase in population from 1960');

Exploring total renewable water resources

In [55]:
plot_hist(recent, 'total_renewable', bins=50, 
          xlabel='Total renewable water resources ($10^9 m^3/yr$)',
          ylabel='Number of countries', 
          title='Distribution of total renewable water resources, 2013-2017');
In [56]:
plot_hist(recent, 'total_renewable', bins=50, 
          xlabel='Total renewable water resources ($10^9 m^3/yr$)',
          ylabel='Number of countries', logx=True,
          title='Log transformed distribution of total renewable water resources, 2013-2017');
In [57]:
north_america_renew = variable_slice(get_subregion(data, 'North America'), 'total_renewable')
In [58]:
fig, ax = plt.subplots(figsize=(16, 12));
sns.heatmap(north_america_renew, ax=ax, cmap=sns.light_palette((214, 90, 60), input="husl", as_cmap=True));
plt.xticks(rotation=45);
plt.xlabel('Time period');
plt.ylabel('Country, ordered by population in 1960 (<- greatest to least ->)');
plt.title('Percent increase in population from 1960');

Total renewable resources doesn't seem to change over time. But, better to check

In [59]:
north_america_renew.head()
Out[59]:
time_period 1958-1962 1963-1967 1968-1972 1973-1977 1978-1982 1983-1987 1988-1992 1993-1997 1998-2002 2003-2007 2008-2012 2013-2017
country
Antigua and Barbuda 0.052 0.052 0.052 0.052 0.052 0.052 0.052 0.052 0.052 0.052 0.052 0.052
Bahamas 0.700 0.700 0.700 0.700 0.700 0.700 0.700 0.700 0.700 0.700 0.700 0.700
Barbados 0.080 0.080 0.080 0.080 0.080 0.080 0.080 0.080 0.080 0.080 0.080 0.080
Belize 21.730 21.730 21.730 21.730 21.730 21.730 21.730 21.730 21.730 21.730 21.730 21.730
Canada 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000 2902.000

subtract 1958-1962 values from each period and add up the results

In [60]:
north_america_renew.sub(north_america_renew.iloc[:,0], axis=0).sum()
Out[60]:
time_period
1958-1962    0.0
1963-1967    0.0
1968-1972    0.0
1973-1977    0.0
1978-1982    0.0
1983-1987    0.0
1988-1992    0.0
1993-1997    0.0
1998-2002    0.0
2003-2007    0.0
2008-2012    0.0
2013-2017    0.0
dtype: float64

Does this apply to the rest of the world?

In [61]:
renew = variable_slice(data, 'total_renewable')
renew.sub(renew.iloc[:,0], axis=0).sum()
Out[61]:
time_period
1958-1962     0.0
1963-1967     0.0
1968-1972     0.0
1973-1977     0.0
1978-1982     0.0
1983-1987     0.0
1988-1992     0.0
1993-1997   -14.0
1998-2002   -14.0
2003-2007   -17.0
2008-2012   -17.0
2013-2017   -17.0
dtype: float64

NO!!!! Look at country

In [62]:
renew.sub(renew.iloc[:,0], axis=0).sum(axis=1).sort_values().head()
Out[62]:
country
Bhutan        -79.0
Afghanistan     0.0
Nigeria         0.0
Niue            0.0
Norway          0.0
dtype: float64

Bhutan changed

In [63]:
renew.sub(renew.iloc[:,0], axis=0).sum(axis=1).sort_values().tail(50)
Out[63]:
country
Djibouti                            0.0
Dominican Republic                  0.0
Hungary                             0.0
Iceland                             0.0
India                               0.0
Indonesia                           0.0
Iran (Islamic Republic of)          0.0
Iraq                                0.0
Ireland                             0.0
Israel                              0.0
Italy                               0.0
Jamaica                             0.0
Japan                               0.0
Jordan                              0.0
Kazakhstan                          0.0
Kenya                               0.0
Kiribati                            0.0
Kuwait                              0.0
Kyrgyzstan                          0.0
Lao People's Democratic Republic    0.0
Latvia                              0.0
Lebanon                             0.0
Lesotho                             0.0
Honduras                            0.0
Dominica                            0.0
Holy See                            0.0
Guyana                              0.0
Ecuador                             0.0
Egypt                               0.0
El Salvador                         0.0
Equatorial Guinea                   0.0
Eritrea                             0.0
Estonia                             0.0
Ethiopia                            0.0
Faroe Islands                       0.0
Fiji                                0.0
Finland                             0.0
France                              0.0
Gabon                               0.0
Gambia                              0.0
Georgia                             0.0
Germany                             0.0
Ghana                               0.0
Greece                              0.0
Grenada                             0.0
Guatemala                           0.0
Guinea                              0.0
Guinea-Bissau                       0.0
Haiti                               0.0
Zimbabwe                            0.0
dtype: float64

Assess the relationship between each variable and the target

Assess the relationships by

  • Linearity
  • Direction
  • Rough size
  • Strength

All our data is of continous type. Hence we need bivariate plotting for continous * continous. The below are the possible plots to consider

  • Scatter plots
  • Correlation matrix heatmap
  • Hexibin plots
  • Joint KDE(Kernel Density Estimation) plots
In [64]:
plt.scatter(recent.seasonal_variability, recent.gdp_per_capita)
plt.xlabel('Seasonal variability');
plt.ylabel('GDP per capita ($USD/person)');
In [65]:
def plot_scatter(df, x, y, xlabel=None, ylabel=None, title=None,
                 logx=False, logy=False, by=None, ax=None):
    if not ax:
        fig, ax = plt.subplots(figsize=(12, 10))

    colors = mpl.rcParams['axes.prop_cycle'].by_key()['color']
    if by:
        groups = df.groupby(by)
        for j, (name, group) in enumerate(groups):
            ax.scatter(group[x], group[y], color=colors[j], label=name)
        ax.legend()
    else:
        ax.scatter(df[x], df[y], color=colors[0])
    if logx:
        ax.set_xscale('log')
    if logy:
        ax.set_yscale('log')

    ax.set_xlabel(xlabel if xlabel else x);
    ax.set_ylabel(ylabel if ylabel else y);
    if title:
        ax.set_title(title);
    return ax
In [66]:
svr = [recent.seasonal_variability.min(), recent.seasonal_variability.max()]
gdpr = [(recent.gdp_per_capita.min()), recent.gdp_per_capita.max()] 
gdpbins = np.logspace(*np.log10(gdpr), 25)
In [67]:
g =sns.JointGrid(x="seasonal_variability", y="gdp_per_capita", data=recent, ylim=gdpr)
g.ax_marg_x.hist(recent.seasonal_variability, range=svr)
g.ax_marg_y.hist(recent.gdp_per_capita, range=gdpr, bins=gdpbins, orientation="horizontal")
g.plot_joint(plt.hexbin, gridsize=25)
ax = g.ax_joint
# ax.set_yscale('log')
g.fig.set_figheight(8)
g.fig.set_figwidth(9)
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\histograms.py:824: RuntimeWarning: invalid value encountered in greater_equal
  keep = (tmp_a >= first_edge)
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\histograms.py:825: RuntimeWarning: invalid value encountered in less_equal
  keep &= (tmp_a <= last_edge)
In [68]:
g =sns.JointGrid(x="seasonal_variability", y="gdp_per_capita", data=recent, ylim=gdpr)
g.ax_marg_x.hist(recent.seasonal_variability, range=svr)
g.ax_marg_y.hist(recent.gdp_per_capita, range=gdpr, bins=gdpbins, orientation="horizontal")
g.plot_joint(plt.scatter)
ax = g.ax_joint
ax.set_yscale('log')
g.fig.set_figheight(8)
g.fig.set_figwidth(9)

Correlation

  • Correlation is used to measure the strength of a linear relationships.
In [69]:
recent_corr = recent.corr().loc['gdp_per_capita'].drop(['gdp','gdp_per_capita'])
In [70]:
def conditional_bar(series, bar_colors=None, color_labels=None, figsize=(13,24),
                   xlabel=None, by=None, ylabel=None, title=None):
    fig, ax  = plt.subplots(figsize=figsize)
    if not bar_colors:
        bar_colors = mpl.rcParams['axes.prop_cycle'].by_key()['color'][0]
    plt.barh(range(len(series)),series.values, color=bar_colors)
    plt.xlabel('' if not xlabel else xlabel);
    plt.ylabel('' if not ylabel else ylabel)
    plt.yticks(range(len(series)), series.index.tolist())
    plt.title('' if not title else title);
    plt.ylim([-1,len(series)]);
    if color_labels:
        for col, lab in color_labels.items():
            plt.plot([], linestyle='',marker='s',c=col, label= lab);
        lines, labels = ax.get_legend_handles_labels();
        ax.legend(lines[-len(color_labels.keys()):], labels[-len(color_labels.keys()):], loc='upper right');
    plt.close()
    return fig
In [71]:
bar_colors = ['#0055A7' if x else '#2C3E4F' for x in list(recent_corr.values < 0)]
color_labels = {'#0055A7':'Negative correlation', '#2C3E4F':'Positive correlation'}

conditional_bar(recent_corr.apply(np.abs), bar_colors, color_labels,
               title='Magnitude of correlation with GDP per capita, 2013-2017',
               xlabel='|Correlation|')
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:1: RuntimeWarning: invalid value encountered in less
  """Entry point for launching an IPython kernel.
Out[71]:

For non-linear relationships, we can try

  • Assess the correlations between the lod transformed data. This might still have non-linearity
  • Bin variables into categories and look at the distribution of other variables for each category.
In [72]:
plot_hist(recent, 'gdp_per_capita', xlabel='GDP per capita ($)', 
         ylabel='Number of countries', 
          title='Distribution of GDP per capita, 2013-2017');
In [73]:
plot_hist(recent, 'gdp_per_capita', xlabel='GDP per capita ($)', logx=True, 
         ylabel='Number of countries', bins=25,
          title='Log Transformed Distribution of log GDP per capita, 2013-2017');
In [74]:
capita_bins = ['Very low', 'Low', 'Medium', 'High', 'Very high']
recent['gdp_bin'] = pd.qcut(recent.gdp_per_capita, 5, capita_bins)
bin_ranges = pd.qcut(recent.gdp_per_capita, 5).unique()
In [81]:
def plot_hist_bins(df, variable, bins=None, xlabel=None, by=None,
              ylabel=None, title=None, logx=False, ax=None):
    if not ax:
        fig, ax = plt.subplots(figsize=(12, 8))
    if logx:
        bins = np.logspace(np.log10(df[variable].min()),
                           np.log10(df[variable].max()), bins)
        ax.set_xscale("log")

    if by:
        if type(df[by].unique()) == pd.core.arrays.categorical.Categorical:
            cats = df[by].unique().categories.tolist()
        else:
            cats = df[by].unique().tolist()

        for cat in cats:
            to_plot = df[df[by] == cat][variable].dropna()
            ax.hist(to_plot, bins=bins);
    else:
        ax.hist(df[variable].dropna().values, bins=bins);

    if xlabel:
        ax.set_xlabel(xlabel);
    if ylabel:
        ax.set_ylabel(ylabel);
    if title:
        ax.set_title(title);

    return ax
In [82]:
plot_hist_bins(recent, 'gdp_per_capita', xlabel='GDP per capita ($)', logx=True, 
         ylabel='Number of countries', bins=25, by='gdp_bin',
          title='Distribution of log GDP per capita, 2013-2017')
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x253068ed208>

Now that we have a CATEGORICAL X CONTINUOUS analysis, we can look at the distribution of a few variables for each gdp group.

In [83]:
recent[['gdp_bin','total_pop_access_drinking']].boxplot(by='gdp_bin');
# plt.ylim([0,100000]);
plt.title('Distribution of percent of total population with access to drinking water across gdp per capita categories');
plt.xlabel('GDP per capita quintile');
plt.ylabel('Total population of country');
In [84]:
def mult_boxplots(df, variable, category, 
                  xlabel=None, ylabel=None, title=None,
                  ylim=None):
    df[[variable, category]].boxplot(by=category);
    
    if xlabel:
        plt.xlabel(xlabel);
    if ylabel:
        plt.ylabel(ylabel);
    if title:
        plt.title(title);
    if ylim:
        plt.ylim(ylim);
In [86]:
mult_boxplots(recent, 'flood_occurence', 'gdp_bin', xlabel='GDP per capita quintile')

Since we have lot of variables, we can rank them by f-value

In [87]:
cat = 'gdp_bin'
cat_no_bin = 'gdp_per_capita'
fps = []
for var in recent.columns.tolist():
    if var != cat and var != cat_no_bin:
        gb = recent[[var, cat]].dropna().groupby(cat)
        f, p = scipy.stats.f_oneway(*gb[var].apply(list).values.tolist())
        fps.append([var, f, p])    

fps = pd.DataFrame(fps, 
                   columns=['variable','f','p']).sort_values('f', 
                                                             ascending=False)
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\stats.py:3333: RuntimeWarning: Mean of empty slice.
  offset = alldata.mean()
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\_methods.py:85: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\stats.py:3336: RuntimeWarning: invalid value encountered in double_scalars
  sstot = _sum_of_squares(alldata) - (_square_of_sums(alldata) / bign)
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\stats.py:3339: RuntimeWarning: invalid value encountered in double_scalars
  ssbn += _square_of_sums(a - offset) / len(a)
C:\Users\mebandar\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\stats\stats.py:3343: RuntimeWarning: invalid value encountered in double_scalars
  ssbn -= _square_of_sums(alldata) / bign
In [88]:
fps.head()
Out[88]:
variable f p
22 human_dev_index 234.618336 4.952986e-70
2 agg_to_gdp 94.414789 2.692111e-41
16 gender_inequal_index 61.450716 1.183911e-30
51 total_pop_access_drinking 39.077823 2.766087e-23
34 rural_pop_access_drinking 36.193797 6.254604e-22
In [89]:
n = 10
for var in fps.variable.tolist()[:n]:
    mult_boxplots(recent, var, cat)
In [90]:
corr = recent.corr()
In [91]:
fig, ax = plt.subplots(figsize=(14,12));
sns.heatmap(corr, ax=ax);
plt.xlabel('');
plt.ylabel('');
plt.title('Correlation matrix heatmap');

Conclusion:

The GDP per capita is highly impacted by the

  • gender inequal index
  • seasonal variablity

But we can see a noticalble affect of the below features

  • flood occurance
  • avg annual rain depth
  • ground water accounted inflow
  • ground water to other countries
  • percent cultivated
  • percent under nourished
  • surface ground water overlap
  • total dam capacity

Hence we cannot ignore the affect of water availability and use on the GDP per capita

In [ ]: